Skip to content

Packing with partial samples#239

Open
voegtlel wants to merge 5 commits into
developfrom
feature/packing_overflow_partial
Open

Packing with partial samples#239
voegtlel wants to merge 5 commits into
developfrom
feature/packing_overflow_partial

Conversation

@voegtlel

@voegtlel voegtlel commented May 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

This is based on #228

  • Add PartialSample for packing-time carryover of user-defined sample slices.
  • Preserve partial restore keys through pushback, postencoding, final packing, checkpoint restore, and random-access restore_sample.
  • Document advanced partial packing with tuple[int, int] Python slicing semantics.

Example:

class MyTaskEncoder(TaskEncoder):
    def select_samples_to_pack(
        self,
        samples: list[TokenizedSample | PartialSample[TokenizedSample, tuple[int, int]]],
    ) -> PackedSamplesOutput[TokenizedSample | PartialSample[TokenizedSample, tuple[int, int]]]:
        sample = samples[0]
        if isinstance(sample, PartialSample):
            base_sample = sample.sample
            start, stop = sample.slice
        else:
            base_sample = sample
            start, stop = 0, len(sample.tokens)
        fit_stop = min(stop, start + self.remaining_context)
        return PackedSamplesOutput(
            packs=[[PartialSample(sample=base_sample, slice=(start, fit_stop))]],
            pushback=(
                [PartialSample(sample=base_sample, slice=(fit_stop, stop))]
                if fit_stop < stop
                else []
            ),
        )

    @stateless
    def postencode_sample(
        self,
        sample: TokenizedSample | PartialSample[TokenizedSample, tuple[int, int]],
    ) -> TokenizedSample:
        if not isinstance(sample, PartialSample):
            return sample
        start, stop = sample.slice
        base = sample.sample
        return TokenizedSample.derive_from(
            base,
            tokens=base.tokens[start:stop],
            labels=base.labels[start:stop],
        )

voegtlel added 5 commits May 5, 2026 09:55
Introduce PartialSample support in PackingDataset so pack selection can carry sliced remainders through pushback, postencode, final packing, and restore. Document the advanced partial-packing contract and cover both postencode and final-packer slicing paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant